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ABSTRACT 

Objective (1) To evaluate a state-of-the-art natural 
language processing (NLP)-based approach to 
automatically de-identify a large set of diverse clinical 
notes. (2) To measure the impact of de-identification on 
the performance of information extraction algorithms on 
the de-identified documents. 
Material and methods A cross-sectional study that 
included 3503 stratified, randomly selected clinical notes 
(over 22 note types) from five million documents 
produced at one of the largest US pediatric hospitals. 
Sensitivity, precision, F value of two automated de- 
identification systems for removing all 18 HIPAA-defined 
protected health information elements were computed. 
Performance was assessed against a manually 
generated 'gold standard'. Statistical significance was 
tested. The automated de-identification performance 
was also compared with that of two humans on a 10% 
subsample of the gold standard. The effect of de- 
identification on the performance of subsequent 
medication extraction was measured. 
Results The gold standard included 30 815 protected 
health information elements and more than one million 
tokens. The most accurate NLP method had 91.92% 
sensitivity (R) and 95.08% precision (P) overall. The 
performance of the system was indistinguishable from 
that of human annotators (annotators' performance was 
92.15%(R)/93.95%(P) and 94.55%(R)/88.45%(P) overall 
while the best system obtained 92.91 %(R)/95.73%(P) on 
same text). The impact of automated de-identification 
was minimal on the utility of the narrative notes for 
subsequent information extraction as measured by the 
sensitivity and precision of medication name extraction. 
Discussion and conclusion NLP-based de- 
identification shows excellent performance that rivals the 
performance of human annotators. Furthermore, unlike 
manual de-identification, the automated approach scales 
up to millions of documents quickly and inexpensively. 



This paper studied automated de-identification of 
clinical narrative text using natural language 
processing (NLP)-based methods. The specific aims 
were (1) to evaluate a state-of-the-art NLP-based 
approach to automatically de-identify a large set of 
diverse clinical notes for all FilPAA (Fiealth Insur- 
ance Portability and Accountability Act)-defined 
protected health information (PHI) elements 
and (2) to measure the impact of de-identification 
on the performance of information extraction (IE) 
algorithms executed on the de-identified docu- 
ments. In addition, we hope that our study — by 



contrasting the performance of human and 
automated de-identification — will shape policy 
expectations. 

BACKGROUND AND SIGNIFICANCE 

The importance of information included in narra- 
tive clinical text of the electronic health record 
(EHR) is gaining increasing recognition as a critical 
component of computerized decision support, 
quality improvement, and patient safety.^ ^ In an 
August, 2011 JAMA editorial, Jha discusses the 
promises of the EHR, emphasizing the importance 
of NLP as an enabling tool for accessing the vast 
information residing in EHR notes. ^ NLP could 
extract information from clinical free-text to 
fashion decision rules or represent clinical knowl- 
edge in a standardized format. "'^^ Patient safety and 
clinical research could also benefit from informa- 
tion stored in text that is not available in either 
structured EHR entries or administrative data.^^' 

However, the 1996 HIPAA privacy rule requires 
that before clinical text can be used for research, 
either (1) all PHI should be removed through 
a process of de-identification, (2) a patient's consent 
must be obtained, or (3) the institutional review 
board should grant a waiver of consent.^" Studies 
have shown that requesting consent reduces 
participation rate, and is often infeasible when 
dealing with large populations.^^ Even if a waiver 
is granted, documents that include PHI should be 
tracked to prevent unauthorized disclosure. On the 
other hand, de-identification removes the require- 
ments for consent, waiver, and tracking and facili- 
tates clinical NLP research, and consequently, the 
use of information stored in narrative EHR notes. 

Several studies have used NLP for removing 
PHI from medical documents. Rule-based 
methods^"*^^^ make use of dictionaries and manu- 
ally designed rules to match PHI patterns in the 
texts. They often lack generalizability and require 
both time and skill for creating rules, but perform 
better for rare PHI elements. Machine-learning- 
based methods,^''^^'' on the other hand, automati- 
cally learn to detect PHI patterns based on a set of 
examples and are more generalizable, but require 
a large set of manually annotated examples. 
Systems using a combination of both approaches 
usually tend to obtain the best results.''^ Overall, 
the best systems report high recall and precision, 
often >90%, and sometimes as high as 99%. 
Nevertheless, no study has evaluated the perfor- 
mance of automated de-identification for all PHI 
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classes. Important items are often ignored — in particular, ages 
w 16 18 24 25 geographic locations/^ ^4 26 institution and 
contact information/"^ dates, and IDs.^'' Furthermore, 
systems should ideally be evaluated on a large scale, including 
the diverse document types of the EHRs, to have a good idea of 
their accuracy and generalizability. However, most systems use 
only one or two document types for evaluation, such as 
pathology reports,^'' 26 discharge summaries,^^ ^5 27-30 34 

nursing progress notes,^^ outpatient follow-up notes,^^ or 
medical message boards. Some of them were only evaluated on 
documents with synthetic patient PHI (manually de-identified 
documents re-identified with fake PHI).^^^^° Very few systems 
have been evaluated on more than two note types. 
Only a handful of studies provide details on over-scrubbing 
(non-PHI wrongly identified as PHI) and none of them investi- 
gate the effect of de-identification on subsequent IE tasks. It is 
indeed possible that de-identification has an adverse effect on IE 
accuracy.'^ Over-scrubbing errors could overlap with useful 
information — for example, if a disease name is erroneously 
recognized as a person name it will be removed and lost to 
subsequent IE appfication. Second, NLP techniques such as part- 
of-speech tagging and parsing may be less effective on modified 
text. 

In this paper, we examine some of the gaps of the literature 
and conduct de-identification experiments on a large set and 
wide variety of clinical notes (over 22 different types), using real 
PHI data (as opposed to resynthesized data), studying all classes 
of PHI and measuring the impact of de-identification on 
a subsequent IE task. We also illustrate the strength of auto- 
matic de-identification by comparing human and system 
performances. 

MATERIAL AND METHODS 
Data 

Three thousand five hundred and three clinical notes were 
selected by stratified random sampling from five million notes 
composed by Cincinnati Children's Hospital Medical Center 
clinicians during 2010. The study was conducted under an 
approved institutional review board protocol. The notes (see 
descriptive statistics in figure 1) belong to three broad categories 



(with the same proportional distribution as the five million 
notes) : 

► Labeled (created within the EHR system and includes division 
origin (eg, emergency department, operating room)) 

► Unlabeled (created within the EHR but no division) 

► External (written outside the EHR (eg, on a radiology system 
and transferred into the EHR through an interface)). 
Within the labeled category, we included 22 note types in 

a randomly stratified sample. We selected a type only if the 
number of notes exceeded the subjective limit of 800 during the 
previous 12 months. We oversampled discharge summaries 
because of their richness in de-identification information,^^ and 
some of the less common notes to have at least 20 notes for each 
type. Figure 1 shows the distribution of note types in our corpus. 
Including the unlabeled and external notes, the total number of 
note types was above 22. 

All 18 HIPAA-defined PHI categories were included in the 
study. Some of them were collapsed into one category. In total 
we defined 12 classes: 

► NAME 

► DATE (eg, "12/29/2005", "September 15th") 

► AGE (any age, not only age >89) 

► EMAIL 

► INITIALS: person's initials 

► INSTITUTION: hospitals and other organizations 

► IP: internet provider addresses and UFiis 

► LOCATION: geographic locations 

► PHONE: phone and fax numbers 

► SSN: social security number 

► ID: any identification number (medical record numbers, etc) 

► OTHER: all remaining identifiers. 

To create a 'gold standard' for building and evaluating 
systems, clinical notes were manually annotated by two 
annotators (native English speakers with Bachelor degrees). All 
notes were double annotated and the final gold standard 
resulted from consensus seeking adjudication led by the 
annotators' supervisor Before production annotation, the 
annotators were trained and the annotation guideline was 
iteratively developed. Double annotation is a standard method 
in NLP because it assures a strong gold standard. We will refer 
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Figure 1 Descriptive statistics of the carpus. DC, discharge; ED, emergency department; H&P, history and physical; OR, operating room. 
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to the two annotators who created the gold standard as 
annotator 1 and annotator 2. 

Additionally, the 1655 'labeled' notes from the corpus were 
also double-annotated for medication names to test the impact 
of de-identification on the subsequent extraction of medication 
names. 

De-identification systems 

We studied the characteristics of two de-identification systems. 
One, MIST (MITRE Identification Scrubber Toolkit), is a proto- 
type from MITRE.^^ The other system was designed in-house 
based on the MALLET machine-learning package.^* Both 
systems are based on conditional random fields (CRFs),^^ but 
implement the algorithm slightly differently. Using the 
MALLET package to build our system gave us access to the 
algorithm's source code (necessary to obtain probability scores 
for recall-bias experiments), while MIST's source code was not 
available. 

We tested the MIST system in its default configuration, and 
with customizations (preprocessing and postprocessing steps 
and additional features for the CRF model). We also tested two 
configurations of the in-house system, one equivalent to the 
"out-of-the-box" MIST (ie, same feature generation process), and 
one with customizations. 

Before training the customized systems, we performed two 
preprocessing steps: tokenization with an in-house tokenizer 
and part-of-speech tagging with the TreeTagger POS tagger (used 
with its downloadable English mode 1).^^ Features for the CRF 
models consisted of the default features generated by MIST: 
token-level properties (capitalization, punctuation, etc) and 
contextual features (token before, token after, etc). Additional 
features we used were token parts-of-speech and presence (or 
absence) of the tokens in a name lexicon (built using the US 
Census Bureau's dataset and the hospital's physician (employee) 
database). 

We also added three postprocessing rules to the machine- 
learning algorithms, consisting of regular expressions to (1) 
identify EInAAIL; (2) match strings to the entries of our name 
lexicon, with a match resulting in the assignment of a NAME 



label; and (3) label any string as a NAME if the algorithm tagged 
a matching string NAME in the document but missed the 
particular string somewhere else in the same document. Step (1) 
was necessary because of the rare frequency of EMAILs, which 
made it difficult for the system to learn their patterns. The 
presence of a word in a name lexicon was also used as a feature 
for machine learning, but adding step (2) as a postprocessing rule 
statistically significantly improved the performance. 

Figure 2 depicts the main steps of the de-identification process 
(identical for both customized systems). 

For convenience, we will refer to the four system versions as 
follows: 

► MISTl: original, "out-of-the-box" MIST system; 

► MIST2: customized MIST system (preprocessing, additional 
features and postprocessing); 

► MCRFl: in-house system with a configuration equivalent to 
MISTl; 

► MCRF2: configuration equivalent to MIST2. 

Experiments 

Evaluation metrics 

We used three standard NLP metrics to measure performance: 
recall (sensitivity), precision (positive predictive value) and 
F value, which is the harmonic mean of recall (R) and 
precision (P) (F=(2*P*R)/(P-fR)).'''' We computed those 
metrics at span level (complete phrase is identified as PHI), 
token level (individual tokens are identified as PHI) and tag-blind 
token level (without taking into account the specific PHI tags). 
Span-level performance was computed for all performance tests. 
Token-level and tag-bling evaluations are provided only for the 
best performing system. 

To rule out the possibility that the performance difference 
between two systems' outputs was due to chance, we also 
tested the statistical significance of the difference, using 

1 ■ ■ 41 42 

approximate randomization. 
Interannotator agreement (lAA) 

lAAwas calculated for the two annotators to define the strength 
of the gold standard,^^ using the F value, after an initial 2-week 



Figure 2 De-identification process. 
CRF, conditional random field; PHI, 
protected health information. 
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training period. We required both span and tag to be the same for 
an annotated element to be counted as a match. 

De-identification performance tests 

We evaluated overall performance (all tags considered) and tag- 
based performance of the MIST and MCRF systems in a 10-fold 
cross-validation setting (the corpus was divided at the document 
level). In addition to the corpus-level test, we also measured the 
de-identification performance for document types. 

A separate subset of 250 annotated documents (not part of 
either the training or testing) was manually examined during 
error analyses (development set). 

Additionally, we also measured the performance of MCRF2' 
on two publicly available datasets: the i2b2 corpus,''^ which 
consists of de-identified discharge summaries (669 reports 
for training and 220 reports for testing) that have been re- 
synthetized with fake PHI; and the PhysioNet corpus,^^ 
which consists of 2483 nursing notes, with very sparse PHI 
elements (1779 in total). We report performance using a cross- 
validation setting for this corpus. 

Humans versus systems performance tests 

We conducted an experiment to compare the performance of the 
automated systems with that of humans. Two native English 
speakers (with Masters and Bachelor degrees) who had not 
previously taken part in the project annotated (independently) 
a random subset of 10% of the corpus (350 documents). We 
evaluated their individual performance against our gold stan- 
dard. We will refer to the two additional annotators as annotator 
3 and annotator 4. 

Recall bias 

In de-identification processes, recall is usually more important 
than precision, so we experimented with infusing recall bias into 
both systems.'^^ For MIST, we used the built-in command line 
parameter that implements Minkov's algorithm."*^ For the 
MCRF system, we increased recall by selecting tokens labeled 
non-PHI and changing their label to the PHI label with the next 
highest probability suggested by the system. We selected non- 
PHI labels only if their system-generated probability score was 
less than or equal to a given threshold (eg, if we set the proba- 
bility threshold at 0.95, every non-PHI label with a score >0.95 
retained the original label). The threshold was varied between 
0.85 and 0.99. In general, the higher we set the threshold, the 
more non-PHI tokens we selected and replaced, leading to higher 
recall. 

Impact of de-identification on subsequent IE 

The impact was tested by measuring the performance of auto- 
mated IE on medication names (a subset of the corpus was 
annotated for medication names, as mentioned in the 'Data' 
subsection). We extracted medication names from clinical notes 
(1) before removing PHI (system trained and tested on original 
corpus), (2) after removing and replacing PHI with asterisks 
(system trained and tested on the corpus with asterisks), and (3) 
after removing and replacing PHI with synthetically generated 
PHI surrogates (system trained and tested on corpus with 
synthetic PHI). In the evaluation of medication IE — for example, 
if the medication name "aspirin" was erroneously tagged as 



'We did not evaluate MIST on those corpora because (1) the two systems are very 
similar and (2) MIST was already evaluated on the i2b2 corpus (its F value ranked 
first in the i2b2 challenge). 



NAME and removed from the corpus, then it was counted as 
false negative for IE. 

We used MIST's built-in functionality to replace the original 
PHI with synthetic PHI. For medication name extraction, we 
used an automated system being developed in-house.^* 

RESULTS 

Corpus descriptive statistics 

The corpus included at least 22 different note types, and more 
than one million tokens (see figure 1). Figure 3 shows the 
number of annotated PHI elements. Almost 50% are located in 
discharge summaries and progress notes. This lopsided distri- 
bution is due to the fact that these note types generally are the 
longest. More than 30% of all PHI was found in discharge 
summaries, confirming findings of Aberdeen et al^^ 

DATE comprised more than one-third of all PHI, and NAME 
about a quarter The third largest category was the mixed group 
of OTHER. Not shown in the figures are categories with 
extremely low frequencies: EIVlAIL (frequency: 14), INITIALS 
(16), IP (10), and SSN (1). 

Interannotator agreement 

The overall F value of lAA was 91.76 for manual 
de-identification between annotators 1 and 2 (see top part of 
figure 4). The LAA for manual medication name annotation 
was 93.51 (1655 "Labeled" notes were annotated for medica- 
tions). These values indicate good agreement for both the 
de-identification and the subsequent medication name 
extraction annotations. 

Automated de-identification performance 

Table 1 (upper section) presents the performance of the de- 
identification systems for each tag type and overall, for the "out- 
of-the box" systems (MISTl and MCRFl) and customized 
systems (MIST2 and MCRF2). In five cases, of the eight PHI 
tags shown, and for overall F value, MCRF2 achieved the highest 
performance. The difference between the two customized 
systems was found to be statistically significant for AGE, 
OTHER, ID, NAME, and overall F values (see lower section of 
table 1). For each tag level and overall F value, the custom- 
izations increased performance of both systems. This increase 
was statistically significant for NAME and overall F values for 
MCRF2 and for AGE, PHONE, DATE, NAME, and overall 
F values for MIST2. 

Table 1 also shows token-level performance for the best 
system (MCRF2). Compared with span level, the token-level 
performance gains range from <0.1% (DATE) to approximately 
18% (LOCATION). Tag-blind token-level performance is even 
higher, with an overall F value of 95.93. 

Table 2 gives the F values obtained by MCRF2 for each 
document type. Performance varies between the different note 
types, although high performance (>90%) is achieved for the 
majority of notes. 

Overall token-level performance of MCRF2 on the i2b2 
corpus was 96.68% F value (99.18% precision, 94.26% recall) 
with our default configuration and 97.44% F value (97.89% 
precision, 97.01% recall) using our recall bias method 
(threshold of 0.91). These results are similar to those obtained 
by the top systems in the i2b2 challenge and slightly lower 
than the performance of MIST (98.1% F value, 98.7% precision, 
97.5% recall, as reported in Uzuner et aP^; however, our system 
was not customized for the i2b2 dataset). Performance on the 
PhysioNet corpus was much lower: 70.60 F value 
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Figure 3 Number of annotated protected health information (PHI) elements for each document type. DC, discharge; ED, emergency department; H&P, 
history and physical; OR, operating room. 
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Figure 4 Inter-annotator agreement iaa on the entire gold standard 



(lAA; F value) for each protected health 
information (PHI) class on the entire 
gold standard (annotators 1 and 2) and 
on the 10% common sample 
(annotators 1, 2, 3, and 4). 
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bias method (0.97 threshold). This is explained by the very low 
frequency of PHI (1779) in the FhysioNet corpus, which makes 
this corpus ill-suited for machine-learning methods (there are 
not enough training instances). In that case, a rule-based 
method such as the one used by the providers of the corpus^'' 
will have higher performance (74.9% precision, 96.7% recall 
and 84.41% F value). Gardner et al^'^ also evaluated their CRF 
algorithm on the FhysioNet corpus and observed a large 
performance drop: they obtained a 25.5% precision for a 97.2% 
recall (40.04 F value) and a 70% precision for an 80% recall 
(74.66 F value). 

Human de-identification performance 

Table 3 shows the performance of the humans compared 
with that of the customized systems on the 10% random 
subset. Both humans performed worst when identifying PHI 
in the OTHER category. Performance of humans and systems 
are close, especially for AGE, DATE, and ID, where statistical 
tests found no significant difference (lower part of table 3). 
Both systems performed significantly better than the two 
humans on OTHER and better than annotator 3 on INSTI- 
TUTION. They both performed worse on LOCATION. 
Humans achieved better performance than the systems on 
NAME and better than MIST2 on PHONE. Both humans 
obtained a lower overall F value than the systems, but the 
highest recall was obtained by annotator 4. Figure 5 visualizes 
the F values obtained by the four systems and the two 
annotators. 

For each tag level and overall F value, the difference between 
each human and the gold standard was statistically significant 
(lower section of table 1), as was the difference between each 
system and the gold standard. 

We also computed IAA between the four humans on the 350 
documents they all annotated (bottom part of figure 4). IAA is 
high between all annotator pairs for AGE, DATE, NAME, 
PHONE categories, and overall. It is low for OTHER, and fluc- 
tuates between the various pairs for IDNUM, INSTITUTION, 
and LOCATION. 



Recall bias 

Changing the command line value parameter (MIST)" and the 
threshold of non-PHI labels (in-house system) resulted in 
varying levels of recall changes. Figure 6 shows the results of the 
experiments for overall performance. The recall variation is 
rather limited on both systems. After a certain point, it reaches 
its maximum and then even decreases slightly, owing to the 
increasing number of non-PHI elements that are erroneously 
collapsed with true PHI. The maximum recall is 93.58 for 
MIST2 (bias parameter value of -3) and 93.66 for MCRF2 (0.93 
threshold) . 

Impact of de-identification on subsequent IE 

The impact of de-identification on the subsequent extraction of 
medication names is negligible. Results are shown in table 4, 
with statistical significance tests. The performance is slightly 
higher on de-identified text (including manually de-identified), 
but the difference is significant on the p<0.05 level only for two 
de-identified corpora. If Bonferroni correction is considered 
(because of the multiple comparisons), then none of the differ- 
ences are significant. 

DISCUSSION 

We performed error analysis for the best system on the devel- 
opment set (350 documents with 3845 PHI). The system made 
476 errors. Of these, 13% (62) were boundary detection errors 
(partially tagged PHI (eg, only "5/iZ" in "Monday 5/-IZ") or PHI 
including extra tokens (eg, in "Fax $i3-$;>^-6666" Fax was also 
tagged)), 24.2% (115) were false positives, although 26.1% (30) 
of them were actually PHI but were labeled as the wrong 
category (eg, "Rochester NY" tagged as NAME instead of 
LOCATION). Ten of the false-positive results were true posi- 
tives missing from the gold standard (missed by annotators 1 
and 2). This happened for the NAME, ID, DATE, and OTHER 
categories. For NAME, a majority of false positives were device 
names (eg, "Sheehy" in "Sheehy tube") or capitalized words (eg, 
"Status Asthmaticus"). For DATE, scores and measurements that 



"MIST is set to have a slight recall bias (-1) out-of-the-box. 
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Table 1 Performance of systems (per-tag and overall precision (P), recall (R) and F value (F)) and statistical significance tests 



Performance of systems (10-fold cross-validation) 
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96.44 
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97.29 
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97.76 
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96.98 


97.46 


ID 


90.58 


92.64 


91.6 


97.23 


95.64 


96.43 


91.17 


93.38 


92.26 


97.27 


95.7 


96.48 


INST 


90.59 


86.41 


88.45 


93.18 


85.01 
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90.61 


87.06 


88.8 


93.25 


85.26 


89.08 


LOG 


79.82 


67.93 


73.4 


86.12 


68.94 


76.58 


78.92 


69.95 


74.16 


87.38 


69.95 


77.7 


NAME 


93.19 
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93.31 
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78 91 


rn 


90 06 


93.04 


91.52 


94.08 


90.64 


92 33 


91 44 


93.95 


92.68 


94 42 


90 87 


92 61 


All 
Mil 




91.02 
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Token-level performance for best system (MCRF2) 




















Token-level 












Token-level -1- tag-blind 
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Abb 




98.21 




93.40 




95.75 








93.42 






DATE 




98.17 




96.87 




97.52 








96.98 






ID 




97.57 




95.49 




96.52 








96.43 






INST 




97.48 




92.74 




95.05 








94.79 






LOC 




97.95 




93.92 




95.89 








96.03 






NAME 




97.26 




97.38 




97.32 








97.53 






OTH 




86.81 




76.45 




81.30 








78.31 






PH 




97.13 




93.40 




95.23 








94.85 






All 




96.68 




93.77 




95.20 




97.42 




94.49 




95.93 



Statistical significance tests between F values obtained by systems (cross-validation evaluation) 





MCRF1 vs MISTI 
p Value 


MCRF2 vs IVIIST2 
p Value 


MISTI vs MIST2 
p Value 


MCRF2 vs MCRF1 
p Value 


MCRF2 vs gold standard 
p Value 


AGE 


0.8490 


*0.0087 


•0.0389 


0.4922 


*0.0001 


DATE 


*0.0001 


0.7650 


•0.0001 


0.2470 


•0.0001 


ID 


*0.0001 


•0.0001 


0.0996 


0.8856 


•0.0001 


INST 


0.3572 


0.6087 


0.2738 


0.6623 


•0.0001 


LOC 


0.0777 


0.0553 


0.5897 


0.3812 


•0.0001 


NAME 


0.4897 


•0.0001 


•0.0001 


•0.0001 


•0.0001 


OTH 


*0.0071 


•0.0180 


0.3676 


0.6118 


•0.0001 


PH 


0.2936 


0.9248 


•0.0458 


0.6208 


•0.0001 


All 


*0.0001 


•0.0001 


•0.0001 


•0.0001 


•0.0001 



•Indicates statistical significance (p<0.05). 

INST, institution; LOC, location; OTH, otiier; PH, phone. 



looked like dates (eg, pain scores such as "Z/iO") were 
often wrongly tagged. Finally, 62.82% (299) of the errors 
were missing PHI, although 9% (27) of those had been tagged 
but with the wrong category. Not counting the mislabeled 
elements, the system missed 38 NAMEs (out of 952), 3 of 
124 IDs, 32 of 1744 DATEs, 8 of 164 PHONEs, 27 of 209 AGEs, 
23 of 186 INSTITUTIONS, 3 of 56 LOCATIONS, and 138 
of 410 OTHERS. The majority of false negatives (58.9%) 
were single-token elements (eg, single first names were more 
often missed by the system than first names followed by last 
names). 

There are many take-home messages in our experiments that 
we believe should influence the decisions of institutional 
review boards about whether to accept the output of auto- 
mated de-identification systems as comparable to manual de- 
identification. First, no single manual de-identification is 100% 
accurate. Even the results of double manual de-identification 
are not perfect. We found statistically significant differences 
between the gold standard that was the result of an adjudi- 
cated double de-identification and the output of the individual 
annotators. Consequently, evaluations that are based on single- 
annotated standard could misjudge the automated system's 

90 



performance. Second, different note types have a different 
density of PHI (and potentially different context of the same 
PHI), and a de-identification system that is trained on a mix of 
note types will show varying performance on these note types. 
As a result, the de-identification performance of machine- 
learning systems will depend on the frequency of PHI types in 
the training data. High performance was achieved for most 
note types in our corpus, so we believe a single system can 
work for multiple note types if the training corpus includes the 
particular note type in sufficient number or if the PHI elements 
of a note type are expressed in similar ways as in other note 
types. Finally, instalhng a high-performance MIST-based 
prototype automated de-identification system is straightfor- 
ward. It involves a few hours setup. Annotating the gold 
standard requires additional effort and its extent depends on 
multiple factors (eg, frequency of PHI in notes). The amount of 
annotations required to achieve high performance varies 
among the different PHI classes, depending on the variability 
of their form and context. For instance, we observed that 
PHONEs (which have regular patterns) and IDs (which 
occurred in easily identifiable contexts, eg, following "MRN:") 
only required a couple of hundred annotations to achieve good 

J Am Med Inform Assoc 201 3:20:84-94. doi:1 0.1 1 36/amiajnl-201 2-001 01 2 



Research and applications 



Table 2 Best system (MCRF2) performance (F values) per document type 





AGE 


DATE 


ID 


INST 


LOG 


NAME 


OTH 


PH 


All PHI 


Aothmn ar'tinn nan 
Malllllld dLllUII pidil 


1 QQ 


96 59 


ou 


96 47 


1 00 


95 17 


98 1 1 


69 72 


92 19 


Rr'iaf DnMnto 


1 nn 


QQ 1 9 


QR R1 


Rn 

OU 


1 nn 
1 uu 


Qd. RR 


RR 9Q 

60. £.3 


1 nn 

1 UU 


QR 1 7 
yo. 1 / 


UUillMIUiMLdLIUM UUUy 


95 24 


93 07 


75 


82 35 


80 


Q7 n? 


80 99 


50 


92 1 1 


Consult DOtG 


Q9 QR 


Qfi R 


ou 


QT R9 


7R 


QR 7R 


Qn 
yu 


Rn 

OU 


QR Q4 

yo.y4 


DC summsriGS 


Q9 RQ 


Qfi 1 9 


QR 9R 


QA RA 

y'f .Of 


77 97 
1 1 .LI 


QR R7 

yo. J / 


7R R9 
/D.DZ 


Q7 QR 

y / .yo 


Q4 RR 


Cu IIIcUILdI bLUUcIll 


96 77 


69 23 


1 00 


66 67 


Q 


95 45 


96 7 


1 00 


92 13 


CU IIULcb 


flR 71 
09./ 1 


80 


1 00 


72 73 


inn 


72 73 


JR RR 


Rn 

UU 


65 84 


ED providGT notGS 


QR 71 
90./ 1 


Qfi AH 


1 nn 
1 uu 


7fi 1 Q 

/u. 1 y 


n 
u 


Rn 
ou 


Q9 RR 


Rn 

OU 


Q9 47 


Pri nrn\/iriar rosccocc 
CU piUVIUci Icdbocob. 


inn 
1 uu 


1 00 


1 00 


100 


1 00 


RR 71 
Ou. / 1 


92 31 


1 00 


92 86 


ncTr 


Q7 7R 


Q7 T1 


ou 


R7 1 


u 


R1 


RR 79 


Rn 

OU 


Q9 RR 


Iv/lQrl cti inonf 
IVIcU oLUUcllL 


RR 1 
00. 1 


98 22 


Q 


69 44 


50 


QR R7 


76 06 


1 00 


92 93 


OpGfdtivG report 


Qn Qi 


Qf! Rfi 

yo.Do 


1 nn 
1 uu 


1 nn 
1 uu 


1 nn 
1 uu 


QR 


Q4 R1 


1 nn 

1 UU 


QR AR 


llR niircinn 

un Muibiiiy 


inn 
1 uu 


1 QQ 


1 QQ 


1 00 


1 00 


RR RQ 


Q 


1 00 


81 82 


rdllcML lllbLi ULLIUMb 


1 QQ 


94 74 


100 


62 5 


66 67 


87 91 


55 


90 62 


81 51 


nidlllldLy MULc 


Rn 

uu 


91 2 


1 00 


1 00 


100 


72 


65 31 


1 00 


88 01 


Plan nf paro nnto 
rldll Ul Ldic MUlc 


inn 
1 uu 


96 97 


1 00 


50 


Q 


Qi nQ 

3 1 .U9 


78 12 


66 67 


87 35 


PfG-Op Gvsludtion 


inn 
1 uu 


QQ T7 


1 nn 
1 uu 


1 nn 
1 uu 


1 nn 
1 uu 




1 nn 

1 UU 


1 nn 

1 UU 


QR R 

yo.D 


ProcGdufG noto 


inn 
1 uu 


Qfi DR 
sO.UD 


inn 
1 uu 


QR R 


n 
u 


QR R1 
9u.3 1 


QI RR 

y 1 .oD 


1 nn 

1 UU 


Q4 QR 


ProQTGSs notes Outp 


oo.Uo 


QR R9 


Qd. 1 9 


7R R 


71 7Q 

/ 1 . /y 


QR R 

yj.o 


7n 9Q 


R1 9R 
O 1 .£.0 


Qn Q 
yu.y 


Progross notes Inp 


93.23 


98.19 


93.88 


87.91 


54.55 


93.32 


76.64 


98.41 


94.75 


Referral 


100 


100 


100 


90.91 


0 


100 


100 


100 


93.75 


Telephone encounter 


100 


91.94 


100 


56.25 


40 


88.21 


47.62 


76.36 


82.16 


All labeled notes 


92.9 


97.51 


97.04 


90.17 


73.02 


94.75 


78.68 


91.54 


93.54 


Unlabeled notes 


92.44 


97.44 


97.07 


87.68 


83.65 


94.68 


80.09 


95.92 


93.58 


External notes 


94.87 


96.79 


69.57 


63.89 


70.59 


85.31 


60.71 


82.22 


91.88 


All notes 


93.22 


97.46 


96.48 


89.08 


77.7 


94.52 


78.91 


92.61 


93.48 



Zero F value is the consequence of insufficient representation of a particular PHI type in that particular note category (eg, if there was one Location PHI element in 20 notes and it was missed 
then the F value was zero). 

DC, discharge; ED, emergency department; H&P, history and physical; OR, operating room; PHI, protected health information. 



performance (>90% F values), while the mixed category of 
OTHER could not reach such high performance even with 
a couple of thousand annotations. 



In addition, of interest for the translational research 
community, we found that automated de-identification did not 
reduce the accuracy of subsequent IE. The performances of the 



Table 3 Performance of humans versus automated systems (per-tag and overall precision (P), recall (R) and F value (F)) 



Performance of humans versus automated systems 
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Statistical significance tests between F values obtained by humans versus systems (•indicates statistical significance (p<0.05), Anno3=annotator 3, Anno4=annotator 4|. 
INST, institution; LOC, location; OTH, other; PH, phone. 
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Figure 5 F values obtained by the 
systems and the humans. MCRF, Mallet 
conditional random field; MIST, MITRE 
Identification Scrubber Toolkit. 




INSTITUTION LOCATION NAME 

PHI classes 



automated de-identification systems were sufficiently high that 
over-scrubbing errors did not affect the value of the de-identified 
corpus for extracting medical information. 

Some of the limitations of our results are the de-identification 
performance for the LOCATION and OTHER categories, which 
should be improved; for proper performance evaluation, a larger 
sample size is necessary for EMAIL, IP, SSN, INITIALS; the 
corpus was obtained from only one institution, though it did 
include over 22 different note types selected from more than five 
million notes; we should experiment with at least one more 
subsequent NLP task to measure the impact of de-identification 
as results might be different with another task. Finally, the 
prototype needs to be transferred to a production environment 



to adequately estimate the cost of setting up a hospital's auto- 
mated de-identification system. 

CONCLUSION 

In this paper, we presented a large-scale study on automated de- 
identification of clinical text, including over 3500 notes from 
a variety of types (>22). We showed that two automated 
systems, an existing system (MIST)^^ and an in-house system, 
could obtain high performance (93.48% span-level and 95.20% 
token-level overall F values for the best system). We also 
compared results of the systems with those obtained by two 
human annotators and found that the performance of the 
systems rivaled that of the humans, with the humans even 



Figure 6 Recall variations obtained by 
adjusting MIST's bias parameter and 
using thresholds for Mallet CRF 
probability scores (customized 
systems). CRF, conditional random 
field; MIST, MITRE Identification 
Scrubber Toolkit. 
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performing slightly worse on a couple of PHI categories and 
overall. Furthermore, unlike manual de-identification, the auto- 
mated approach scales up to millions of documents quickly and 
inexpensively. Finally this study also goes beyond de-identifi- 
cation performance testing by looking at the effect of de-iden- 
tification on a subsequent IE task (medication extraction), for 
which no decrease in performance was seen. 
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